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Abstract 

Entity Disambiguation aims to link men¬ 
tions of ambiguous entities to a knowl¬ 
edge base (e.g., Wikipedia). Modeling 
topical coherence is crucial for this task 
based on the assumption that information 
from the same semantic context tends to 
belong to the same topic. This paper 
presents a novel deep semantic related¬ 
ness model (DSRM) based on deep neu¬ 
ral networks (DNN) and semantic knowl¬ 
edge graphs (KGs) to measure entity se¬ 
mantic relatedness for topical coherence 
modeling. The DSRM is directly trained 
on large-scale KGs and it maps hetero¬ 
geneous types of knowledge of an entity 
from KGs to numerical feature vectors in 
a latent space such that the distance be¬ 
tween two semantically-related entities is 
minimized. Compared with the state-of- 
the-art relatedness approach proposed by 
( Milne and Witten, 2008a| |, the DSRM ob¬ 
tains 19.4% and 24.5% reductions in en¬ 
tity disambiguation errors on two publicly 
available datasets respectively. 

1 Introduction 


Entity disambiguation is the task of linking men¬ 
tions of ambiguous entities to their referent entities 


in a knowledge base (KB) such as Wikipedia (Mi- 
halcea and Csomai, 2007| ) FI Eor example, the 


mentions (e.g., “Detroit” and “Miami”) in Eigure[T] 
should be linked to entities related to National 
Basketball Association (NBA) such as basketball 
teams “Detroit Pistons” and “Miami Heat”, in¬ 
stead of cities “Detroit” and “Miami”. 


*We consider an entity e as a page in Wikipedia or a node 
in knowledge graphs, and an entity mention m as an n-gram 
from a specific natural language text. And in this work, we 
focus on entity disambiguation and we assume that mentions 
are given as input (e.g., detected by a named entity recogni¬ 
tion system). 
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Eigure 1: An illustration of Entity Disambiguation 
task. Referent entities are marked in bold. 


A crucial evidence for this task is topical co¬ 
herence which assumes that information from 
the same context tends to belong to the same 
topic. Eor instance, the text in Eigure [T] is 
on a specific topic NBA basketball, and we can 
see that the mentions from this text are also 
linked to entities related to this topic. Model¬ 
ing topical coherence normally requires to de¬ 
fine a measure to capture semantic relatedness 
between candidate entities of the mentions from 
the same context. The standard relatedness mea¬ 
sure widely adopted in existing disambiguation 
systems leveraged Wikipedia anchor links with 


Normalized Google Distance (Milne and Witten, 


2008a I, which can be formulated as: 


5i?™(e„e,) = l- 


logmax( \Ei\,\Ej\)-\og\E^r]Ej 
log(|£;|) - logmin(|£;,|, \Ej\) 


where |£'| is the total number of entities in 
Wikipedia, and Ei and Ej are the set of entities 
that have links to Ei and Ej, respectively. Our 
analysis reveals that it generates unreliable relat¬ 
edness scores in many cases and tends to be biased 
towards popular entities. Eor instance, it predicts 
that “NBA” is more semantically-related to the 
city “Chicago” than its basketball team “Chicago 
Bulls”. This is because popular entities such as 
“Chicago” tend to share more common incom¬ 
ing links with other entities in Wikipedia. Also, 
an underlying assumption of this method is that 
semantically-related entities must share common 
anchor links, which is too strong. 

To address these limitations, we propose a 
novel deep semantic relatedness model (DSRM) 
































that leverages semantic knowledge graphs (KGs) 
and deep neural networks (DNN). In the past 
decade, tremendous efforts have been made to 
construct many large-scale structured and linked 
KGs (e.g., Freebase and DBpedia |^, which 
stores a huge amount of clean and important 
knowledge about entities from contextual and 
typed information to structured facts. Each fact 
is represented as a triple connecting a pair of en¬ 
tities by a certain relationship and of the form 
{left entity, relation, right entity}. An ex¬ 
ample about the entity “Miami Heat” in Freebase 
is as shown in Figure These semantic KGs 
are valuable resources to enhance relatedness mea¬ 
surement and deep understanding of entities. 


The Miami Heat are an American professional basketball team based in Miami, 
Florida. The team is a member of the Southeast Division in the Eastern 
Conference of the National Basketball Association. They play their home 
games at the American Airlines Arena in Downtown Miami. The team owner is 
Micky Arison, who also owns cruise-ship giant Carnival Corporation. 



Figure 2: An example of Freebase. Nodes rep¬ 
resent entities such as “Miami Heat”, and edges 
represent semantic relations such as “Coach” and 
“Location”. Each entity is also provided with tex¬ 
tual description and entity types. 


Fow dimensional representations (i.e., dis¬ 
tributed representations) of objects (e.g., words, 
documents, and entities) have shown remarkable 
success in the fields of Natural Fanguage Pro¬ 
cessing (NFP) and information retrieval due to 
their ability to capture the latent semantics of 


objects (Collobert et ah, 2011; Huang et ah. 


20131. Deep learning techniques have been ap¬ 


plied sucessfully to learn distributed representa¬ 
tions since they can extract hidden semantic fea¬ 
tures with hierarchical architectures and map ob¬ 


jects into a latent space (e.g., (Bengio et ah, 2003 


Collobert et ah, 201 ![ [Huang et ah, 2013t|Bordes 
et ah, 2013 Socher et ah, 2013| )). Motivated by the 
previous work, we propose to learn latent seman¬ 
tic entity representations with deep learning tech¬ 
niques to enhance entity relatedness measurement. 
We directly encode heterogeneous types of se- 


^https://www.freebase.com/ 

^http://www.dbpedia.org/ 


mantic knowledge from KGs including structured 
knowledge (i.e., entity facts and entity types) and 
textual knowledge (i.e., entity descriptions) into 
DNN. By automatically mining a large amount of 
training instances from KGs and Wikipedia, we 
then train the neural network models discrimina- 
tively in a supervised fashion such that the dis¬ 
tances between semantically-related entities are 
minimized in a latent space. In this way, the neu¬ 
ral networks can be optimized directly for the en¬ 
tity relatedness task and capture semantics in this 
dimension. Therefore, compared to the standard 
approach proposed by (Milne and Witten, 2008aI, 
our proposed DSRM is in nature a deep seman¬ 
tic model that can capture the latent semantics of 
entities. Another advantage is that it can capture 
more semantically-related relations between enti¬ 
ties which do not share any common anchor links. 

The main contributions of this paper are sum¬ 
marized as follows: 


• We propose a novel deep semantic relatedness 
model based on DNN and semantic KGs to 
measure entity semantic relatedness. 

• We explore heterogeneous types of semantic 
knowledge from KGs and show that semantic 
KGs are better resources than Wikipedia anchor 
links to measure entity relatedness. 

• By conducting extensive experiments on pub¬ 
licly available datasets from both news and 
tweets, we show that the proposed DSRM sig¬ 
nificantly outperforms several competitive base¬ 
line approaches regarding both relatedness mea¬ 
surement and entity disambiguation quality. 


2 Related Work 


Measuring semantic similarity or relatedness be¬ 
tween words, phrases, and entities have many ap¬ 
plications in NFP such as the entity disambigua¬ 
tion task studied in this work. Existing approaches 
mainly leveraged some classic similarity measures 
that do not utilize semantics or topic models and 
they were built on top of a thesaurus (e.g.. Word- 


Net) or Wikipedia (McHale, 19981 

Fandauer et ah. 


Strube and Ponzetto, 2006 Gabrilovich and 

Markovitch, 2007 

Milne and Witten, 2008at |Cec- 

carelli et ah, 20131. In contrast, we leverage both 


structured and contextual information from large- 
scale semantic KGs and deep semantic models to 
measure entity relatedness. 


Most existing entity disambiguation methods 
considered entity relatedness as a crucial evi- 













































dence, from non-collective approaches that re 
solve one mention at each time ([Mihalcea and 


Csomai, 2007t [Milne and Witten, 2008bt [Guo etj 


ah, 2013]l to collective approaches that leverage 


the global topical coherence for joint disambigua- 


tion (jCucerzan, 2007 

Kulkarni et ab, 2009||Han et 

ab, 2011[|Ratinov et ab, 20111 

Hoffart et ab, 2011 

Cassidy et ab, 2012 

Shen et ab, 201 3[ 

Liu et ab. 

2013[ Huang et ab, 2014||. Huang et ab (2014|| 


proposed a collective approach based on semi- 
supervised graph regularization and achieved the 
state-of-the-art performance for tweets. To study 
the impact of our proposed DSRM on entity dis¬ 
ambiguation, we adapt their approach and develop 
an unsupervised approach to model topical coher¬ 
ence for both news and tweets. 

This work is highly related to distributed repre¬ 
sentation learning of textual objects such as words, 
phrases, and documents with deep learning tech¬ 
niques (e.g., ( Bengio et ah, 2003} Mnih and Hin- 
ton, 20071 [Hinton and Salakhutdinov, 2010[ jCol- 


lobert et ah, 2011, Bordes et ah, 2012} Huang et 


jab, 2012t|Bordes et ah, 2013t|Socher et ah, 201^ 

Mikolov et ah, 2013 He et ah, 2013t Huang et 


lab, 20131 [Shen et ah, 2014] |Yih et ah, 20141 [Gao] 


et ab, 20141). Among the above work, Huang et 


ab (20131 is the most relevant to ours. We ex¬ 


tend their work to large-scale semantic KGs by 
leveraging both structured and contextual knowl¬ 
edge for semantic representation learning of enti¬ 
ties. Then we apply the approach to model topical 
coherence for entity disambiguation, as opposed 
to Web search. He et ab (2013| l first explored deep 
learning techniques to measure local context sim¬ 
ilarity for entity disambiguation. This work com¬ 
plements theirs since we aim to measure entity re¬ 
latedness for global topical coherence modeling. 

Semantic KGs have been demonstrated to be 
useful resources for external knowledge mining 


for entity and relation extraction (Hakkani-Tiir et 
ab, 2013| [Heck et ab, 2013 j) and coreference and 


Weikum, 2015||. Some recent work learned dis- 

tributed representations for entities directly from 

KGs for semantic parsing 

(Bordes et ab, 20121, 

link prediction (Bordes et ab, 2013 

Socher et ab. 

20T3l Wang et ab, 2014 

Jn et ab, 2015|l, and 


question answering (Bordes et ab, 2014] Yang et 


ab, 20141. Our work leverages KGs to learn en¬ 


tity representations to measure entity relatedness, 
which is different from the above problems. 


3 A Deep Semantic Relatedness Model 
(DSRM) 

In order to measure entity relatedness for topical 
coherence modeling, we propose to learn latent 
semantic entity representations that capture the la¬ 
tent semantics of entities. To learn entity repre¬ 
sentations, we directly encode various semantic 
knowledge from KGs into DNN. 

3.1 The DSRM Architecture 
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Figure 3: The DSRM architecture. 


The architecture of the DSRM is shown in Fig¬ 
ure The knowledge representations of an en¬ 
tity from KGs are shown in the two bottom lay¬ 
ers (Feature Vector and Word Hashing). Follow¬ 
ing ( Huang et ab, 201^ , we adopt the letter-n- 
gram based word hashing technique to reduce the 
dimensionality of the bag-of-word term vectors. 
This is because the vocabulary size of the large- 
scale KGs is often very large (e.g., more than 4 
million Wikipedia entities and 1 million bag-of- 
words exist in Wikipedia), which makes the “one- 
hot” vector representation very expensive. How¬ 
ever, the word hashing techniques can dramati¬ 
cally reduce the vector dimensionality to a con¬ 
stant small size (e.g., 50k). It also can handle the 
out-of-vocabulary words and newly created enti¬ 
ties. The specific approach we use is based on 
letfer tri-grams. For instance, the word “cat” can 
be split into letter tri-grams (#ca, cat, at#) by first 
adding start- and end- marks to the word (e.g., 
#cat#). We then use a vector of letter tri-grams to 
represent the word. In particular, we leverage four 
types of knowledge from KGs to represent each 
entity e, which is described in details as follows: 


• Connected Entities E: the set of connected 
entities of e. For instance, as shown in Fig¬ 
ure]^ E = {“Erik Spoelstra”, “Miami”, “NBA”, 
“Dwyane Wade”} for “Miami Heat”. For each 
Ci G E, we generate its surface form and rep¬ 
resent it as bag-of-words. And then the word 






















































































































































hashing layer transforms each word into a letter 
tri-gram vector. 

• Relations R\ the set of relations that e 
holds. For example, R = {“Coach”, “Location”, 
“Founded”, “Member”, “Roster”{ for “Miami 
Heat” in Figure Each relation in R is 
represented as a binary “one-hot” vector (e.g.., 
[ 0 ,..., 0 , 1 , 0 ,..., 0 ]). 

• Entity Types ET: the set of attached entity 
types for e. ET = {“professional sports team”} 
for “Miami Heat”. We represent each entity 
type as a binary “one-hot” vector. We do not 
adopt word hashing to break down relations and 
entity types because their sizes are relatively 
small (i.e., 3.2k relations and 1.6k entities). 

• Entity Description D\ the textual description 
of an entity. The description provides a concise 
summary of salient information of e. For in¬ 
stance, from the description of “Miami Heat”, 
we can learn about its important information 
such as role, location, and founder. The descrip¬ 
tion is represented as bag-of-words, which are 
then transformed by the word hashing layer into 
letter tri-gram vectors. 

On top of the word hashing layer, we have mul¬ 
tiple hidden layers to perform non-linear transfor¬ 
mations, which allow the DNN to learn useful se¬ 
mantic features by performing back propagation 
with respect to an objective function designed for 
the entity relatedness task. Finally, we can obtain 
the semantic representation y for e from the top 
layer. Denoting x as the input feature vector of 
e, y as the output semantic vector of e, N as the 
number of layers, li,i = 1,..., — 1 as the out¬ 

put vectors of the intermediate hidden layers, Wi 
and bi as the weight matrix and bias term of the i- 
th layer respectively, we then can formally present 
the DSRM as: 


3.2 Learning the DSRM 

Training Data Mining: In order to train the 
DSRM which can capture semantics specific to the 
entity relatedness task, we first automatically mine 
training data based on KGs and Wikipedia anchor 
links. Beyond using linked entity pairs from KGs 
as positive training instances, we also mine more 
training data (especially negative instances) from 
Wikipedia. Suppose fj is an anchor text from a 
Wikipedia article, and it is linked to an entity e^. 
And tj is an anchor text within 5 = 150 character 
window of ti, and Cj is its linked entity. Then we 
consider {a, ef) as a positive training instance. To 
obtain negative training instances for Cj, we ran¬ 
domly sample 5 other candidate entities of tj (de¬ 
noted as Ej), and consider each (cj, e') as a nega¬ 
tive training instance for each e'- G Ej. Similarly, 
we obtain negative training instances for ej. In 
this way, we finally obfain abouf 20 million posi¬ 
tive fraining pairs and 200 million negafive frain- 
ing pairs. By mining the training instances auto¬ 
matically, we can train the DSRM in an unsuper¬ 
vised way and save tremendous human annotation 
efforts. 

Model Training: Following ( [Collobert et ah, 
201 1[ |Huang et ah, 2013| |Shen et ah, 2014 ), we 
formulate a loss function as: 


£(A) = - log P{e'^\e), 

(e,e+) 

where A denotes the set of parameters of the 
DSRM, and e"*" is a semantically-related entity of 
e. P{ej\ei) is the posterior probability of entity Cj 
given Cj through the softmax function: 

eM-fSR^^^^{ei,ej)) 


h = Wix 

h = fiWfi-i + bi), i = 2,...,N — l 
V = + ^iv) 

where we use the tank as the activation function 
at the output layer and the intermediate hidden lay¬ 
ers. Specifically, f{x) = tanh{x) = 

Afler we obfain fhe semantic represenfafions 
for enfify a and Cj, we use cosine similarify fo 
measure fheir relafedness as SR^^^^yei,ej) = 

\\y^y\\\\y,.\\ ’ where and ye, are fhe semantic 
represenfafions of a and ej, respecfively. 


where 7 is fhe smoofhing paramefer which is 
defermined based on a held-ouf sef, and Ei is fhe 
sef of relafed or non-relafed enfifies of e* in fhe 
fraining dafa. 

To obfain fhe opfimal solution, we need fo min¬ 
imize fhe above loss function. In order fo avoid 
over-tiffing, we defermine model paramefers with 
cross validation by randomly splitting the mined 
concept pairs into two sets: training and vali¬ 
dation sets. We set the number of hidden lay¬ 
ers as 2 and the number of units in each hid¬ 
den layer and output layer as 300. Following 
(Huang et ah, 20131, we initialize each weight ma- 












trix Wi,i = — 1 with a uniform distri¬ 


bution: Wi 
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(|h-l|-|-|h| ’ V (|h-l|-|-|h| 


where|Z| is the size of the vector 1. Then we train 
the model with mini-batch based stochastic gradi¬ 
ent descent]^ and the training normally converges 
after 20 epochs in our experiments. We set mini¬ 
batch size of training instances as 1024. It takes 
roughly 72 hours to finish the model training on 
an NVidia Tesla K20 GPU machine. 


4 Topical Coherence Modeling with 
Unsupervised Graph Regularization 

There exist many approaches to model topical co¬ 
herence to enhance entity disambiguation. A re¬ 
cent approach proposed by ( [Huang et ah, 2014| ) 
leveraged a semi-supervised graph regularization 
model to perform collective inference over mul¬ 
tiple tweets and achieved the state-of-the-art per¬ 
formance. In this work, we adapt their approach 
and develop a completely unsupervised frame¬ 
work which is more suitable for the entity disam¬ 
biguation task. This is because it is challenging to 
obtain manually labeled seeds for new and unseen 
data (e.g., a news document). 

4.1 Relational Graph Construction 



Figure 4: A portion of the relational graph con¬ 
structed for the example in Figure [T] The enti¬ 
ties marked in bold are the referent entities for the 
mentions in the same node. 

We first construct a relational graph G = 
{V, E) 1^ where U is a set of nodes and 77 is a set 
of edges. Each node Vi = (m*, e*) contains a pair 
of mention rrii and its entity candidate Cj. Each 
node Vi is also associated with a ranking score 
indicating the probability of e* being the referent 
entity of m*. 


"'Due to space limitation, we do not derive the derivatives 
of the loss function here. Readers can refer to jCollobert et| 
ah, 2011 Huang et ah, 2013| for more details. 


^In this work, we choose not to incorporate the set of local 
features adopted in jHuang et ah, 2014) since they are mainly 
designed for mention detection instead of disambiguation. 


A weighted edge is added between two nodes 
Vi and Vj if and only if (i) rrii and mj are rele¬ 
vant, (ii) Cj and Cj are semantically-related, (iii) 
Vj is one of the fc-th nearest neighboring nodes 
of Vi 0 Eet W be the weight matrix of the rela¬ 
tional graph G. Then if Vi and Vj satisfy the above 
three conditions, the weight of their connecting 
edge is computed as: Wij = SR{ei,ej), where 
SR{ei,ej) is a relatedness measure. Otherwise, 
Wij is set as 0. To determine whether rrii and m, 


are relevant or not in tweets, we follow (Huang 


et ah, 20141 to check their connectivity in a net¬ 
work constructed from social relations such as au¬ 
thorship and #hashtag. Eor news, we only model 
topical coherence within one single document (rrii 
and rrij are relevant if they are from the same doc¬ 
ument). This is because each news document nor¬ 
mally has relative rich contextual information, and 
we do not have adequate social relations such as 
authorship relation in the dataset. An example re¬ 
lational graph is shown in Eigure]^ 

4.2 Ranking Score Initialization and 
Automatic Labeled Seed Mining 

We initialize the ranking score of each node based 


on a sub-system of AIDA ( jHoffart et ah, 20rT] |, 
which relies on the linear combination of prior 
popularity and context similarity. The prior pop¬ 
ularity is measured based on the frequency of 
Wikipedia anchor links. The context similarity 
proposed in AIDA is computed based on the ex¬ 
tracted keyphrases (e.g., Wikipedia anchor texts) 
of an entity and all of their partial matches in the 
text of a mention. 

We also adopt two heuristics to mine a set of 
labeled seed nodes for the graph regularization 
model: (i) If a node v contains unambiguous men¬ 
tion, then V is selected as a seed node and it has 
an initial ranking score 1.0. (ii) Eor a mention 
m with the top ranked candidate entity by prior 
popularity as e, if the prior popularity p{e\m) of 
e satisfies p(e\m) > 0.95 and e is also the top 
ranked entity by context similarity, then all nodes 
related to m are selected as labeled seeds. The 
node V = {m, e) will be assigned a ranking score 
1.0, and other nodes will be assigned a ranking 
score 0. During the graph regularization process, 
the ranking scores of these labeled seed nodes will 
remain unchanged. 


“We set k as 20, which is obtained from a development 


set. 






































4.3 Graph Regularization 

We then utilize the graph regularization model to 
refine the ranking scores of unlabeled nodes simul¬ 
taneously. Denoting the first I nodes as seed nodes 
with initial ranking scores as Ri (I = 0 is possible 
in some cases if our approach fails to find some la¬ 
beled seeds), the remaining u nodes (n = n—l) are 
initialized with ranking scores R^, and W as the 
weight matrix of the relational graph G. Then the 
graph regularization framework ( Zhu et ah, 2003t 
Huang et ah, 20141 can be formulated as R{TZ) = 
+ 5 Ei j Wij in - rjf, where 
fi is a. regularization parameter that controls the 
trade-off between initial rankings and smoothness 
over the graph structure. This loss function aims to 
ensure the constraint that two strongly connected 
nodes should have similar ranking scores. There 
exist both closed-form and iterative solutions for 
the above optimization p roblem since TjTZ ) is 
convex [|(|zg et ah, 200^|Huang et ah, 2014| ). 


5 Experiments 

In this section, we evaluate the performance of 
various semantic relatedness methods and their 
impact on the entity disambiguation task. 

5.1 Data and Scoring Metric 

For our experiments we use Wikipedia as our 
knowledge base, which originally contains 30 mil¬ 
lion entities. To reduce noise, we remove the en¬ 
tities which have fewer than 5 incoming anchor 
links and obtain 4 millions entities. And we use 
a portion of Freebase limited to Wikipedia enti¬ 
ties as the semantic KG with detailed statistics 
shown in Table [T] To evaluate the quality of 
entity relatedness, we use a benchmark test set 
created by ( [Ceccarelli et ah, 2013 1 from CoNLL 
2003 data. It includes 3, 314 entities as testing 
queries and each query has 91 candidate entities 
in average to measure relatedness. After obtain¬ 
ing the ranked orders of candidate entities for 


these queries, we compute the nDCG (Jarvelin 


and Kekalainen, 2002| ) and mean average preci¬ 
sion (MAP) ( Manning et ah, 2008| ) scores to eval¬ 
uate the relatedness measurement quality. 

To evaluate disambiguation performance, we 
use two public test sets composed of both news 
documents and tweets: (i) AIDA ([Hoffart et ah. 


20111 is a news dataset based on CoNLL 2003 


Readers can refer to \Zhu et al., 20Q3[|Huang et at., 2014 1 
for the derivation of the closed-form and iterative solutions. 


Knowledge Graph Element 

Size 

# Entities 

4.12m 

# Relations 

3.17k 

# Entity Types 

1.57k 


Table 1: Statistics of Freebase KG. 


data. It includes 131 documents and 4,485 non- 


NIL mentions, (ii) A tweet set released by (Meij 


et ah, 2012 1 , which includes 502 tweets and 812 


non-NIL mentions. We follow the previous work 
(Cucerzan, 20071 and leverage Wikipedia anchor 
links to construct a mention-entity dictionary for 
candidate generation. And for efficiency, we only 
consider the top 30 ranked entities by prior popu¬ 
larity in our systems. We compute both standard 
micro (aggregates over all mentions) and macro 
(aggregates over all documents) precision scores 
over the top ranked candidate entities for disam¬ 
biguation performance evaluation. 


5.2 Experimental Settings 

The semantic relatedness measures we study in 
this work are summarized in Table |2l We com¬ 
bine these measures with the unsupervised graph 
regularization model (GraphRegu) and develop an 
unsupervised collective inference framework for 
entity disambiguation. We compare our methods 
with several state-of-the-art entity disambiguation 
approaches as follows (The first three methods 
were developed for news, and TagMe and Meij 
were proposed for tweets): 


Kuljsp: This is a collective approach with inte¬ 
ger linear programs (Kulkami et ah, 20091. 
Shirak: This approach utilizes a probabilistic 


taxonomy with the Naive Bayes model (Shi- 
ra kawa et ah, 201 1 ] ). 


AIDA: This is a graph-based collective ap¬ 
proach which finds a dense subgraph for joinf 
disambiguafion ( [Hoffarf ef ah, 20lT ). 

TagMe: This approach defermines fhe referenf 
enfify by compufing the sum of weighted av¬ 
erage semantic relatedness scores between en¬ 
tities (Ferragina and Scaiella, 2010l. 

Meij: A supervised approach based on the ran¬ 
dom forest model (Meij et ah, 2012 1 . 
GraphRegu -i- a relatedness measure: This is 
the collective inference framework we develop 
by combining the GraphRegu with a related¬ 
ness measure for the relational graph construc¬ 
tion. Specially note that GraphRegu + M&W 
can be considered as a baseline approach we 
adapt from (Huang et ah, 20141. 
















































Methods 

Descriptions 

M&W 

The Wikipedia anchor link-based method proposed by 1 Milne and Witten, 2008a 1 . 

DSRMi 

Our proposed DSRM based on connected entities. 

DSRM 12 

DSRMi + relations. 

DSRM 123 

DSRM 12 + entity types. 

DSRM 1234 

DSRM 123 + entity descriptions. 


Table 2: Description of semantic relatedness methods. 


Methods 

nDCG@l 

nDCG@5 

nDCG@10 

MAP 

M&W 

0.54 

0.52 

0.55 

0.48 

DSRMi 

0.68 

0.61 

0.62 

0.56 

DSRM 12 

0.72 

0.64 

0.65 

0.59 

DSRM 123 

0.74 

0.65 

0.66 

0.61 

DSRM 1234 

0.81 

0.73 

0.74 

0.68 


Table 3: Overall performance of entity semantic 
relatedness methods. 


Next, we study the relatedness measurement 
quality of various relatedness methods (Sec¬ 
tion |5.3[), and the impact of relatedness methods 


on entity disambiguation (Section[54|). 


5.3 Quality of Semantic Relatedness 
Measurement 

The overall performance of various relatedness 
methods are shown in Table [3] We can see that 
the DSRM significantly outperforms the standard 
relatedness method M&W (p < 0.05, accord¬ 
ing to the Wilcoxon Matched-Pairs Signed-Ranks 
Test), indicating that deep semantic models based 
on semantic KGs are more effective for related¬ 
ness measurement. As we incorporate more types 
of knowledge into the DSRM, it achieves better 
relatedness quality, showing that the four types of 
semantic knowledge complement each other. 

To study the main differences between M&W 
and the DSRM, we also show some examples 
of relatedness scores in Table |4] and |5] From 
both tables, we can see that M&W predicts that 
“NBA” and “NFL” are more semantically-related 
to cities/states than their professional teams. How¬ 
ever, the DSRM produces more reasonable scores 
to indicate that these sports teams are highly 
semantically-related to their association. We can 
also see that M&W tends to generate high relat¬ 
edness scores for popular entities (e.g., “New York 
City” and “Philadelphia”), but the DSRM does not 
have such a bias. 


5.4 Impact on Entity Disambiguation 

The regularization parameter p of the GraphRegu 
is set as 0.8 for both datasets, obtained from a 
held-out set from CoNLL 2003 data. The overall 
disambiguation performance is shown in Table 
and 1^ for the AIDA dataset and the tweet set, re¬ 


Methods 

M&W 

DSRM 1234 

New York City 

0.90 

0.22 

New York Knicks 

0.79 

0.79 

Boston 

0.75 

0.27 

Boston Celtics 

0.58 

0.77 

Dallas 

0.69 

0.35 

Dallas Mavericks 

0.52 

0.74 

Philadelphia 

0.81 

0.27 

Philadelphia 76ers 

0.58 

0.85 


Table 4: Examples of relatedness scores between 
a sample of entities and the entity “NBA”. 


Methods 

M&W 

DSRM 1234 

New York City 

0.89 

0.09 

New York Jets 

0.92 

0.63 

Boston 

0.92 

0.19 

Boston Bruins 

0.62 

0.38 

Dallas 

0.87 

0.34 

Dallas Cowboys 

0.72 

0.68 

Philadelphia 

0.93 

0.19 

Philadelphia Eagles 

0.79 

0.65 


Table 5: Examples of relatedness scores between 
a sample of entities and the entity “National Foot¬ 
ball League”. 


spectively. Compared with other strong baseline 
approaches, our developed unsupervised approach 
GraphRegu + M&W adapted from ([Huang et al.. 


20141 achieves very competitive performance for 


both datasets, illustrating that the GraphRegu is ef¬ 
fective to model topical coherence for entity dis¬ 
ambiguation. 

Our best system based on the DSRM with all 
four types of knowledge (denoted as DSRM 1234 ) 
significantly outperforms various strong baseline 
competitors for both datasets (all with p < 0.05). 
Specially compared with the standard method 
M&W, DSRM 1234 achieves 24.5% and 19.4% rel¬ 
ative reductions in disambiguation errors for news 
and tweets, respectively. Eor instance, GraphRegu 
-I- M&W fails to disambiguate the mention “Mid¬ 
dlesbrough” to the football club “Middlesbrough 
F.C.” in the text “Lee Bowyerwas expected to play 
against Middlesbrough on Saturday.”. This is be¬ 
cause M&W generates the same semantic related¬ 
ness score (0.39) between (“Middlesbrough E.C.”, 
“Eee Bowyer”) and (“Middlesbrough” and “Eee 
Bowyer”). However, DSRM 1234 computes the re¬ 
latedness score for the former pair as 0 . 68 , much 


















































Baseline Approaches 

Our Methods 


KuLsp 

Shirak 

AIDA 

GraphRegu + 
M&W 

DSRMi 

GraphRegu + 
DSRM 12 DSRM 123 

DSRM 1234 

micro P@1.0 

72.87 

81.40 

82.29 

82.23 

84.17 

85.33 84.91 

86.58 

macro P@1.0 

76.74 

83.57 

82.02 

81.10 

83.30 

83.94 83.56 

85.47 


Table 6 : Overall disambiguation performance (%) on AIDA dataset. 


Baseline Approaches 

Our Methods 


TagMe 

Meij 

GraphRegu + 
M&W 

DSRMi 

GraphRegu + 
DSRM 12 DSRM 123 

DSRM 1234 

micro P@1.0 

61.03 

68.33 

65.13 

69.19 

70.22 71.50 

71.89 

macro P@1.0 

60.46 

69.19 

66.20 

69.04 

69.61 70.92 

71.72 


Table 7: Overall disambiguation performance (%) on tweet set. 


higher than the score 0.33 of the latter one, thus 
GraphRegu + DSRM 1234 correctly disambiguates 
the mention. 


5.5 Discussion 


In this subsection, we aim to answer two ques¬ 
tions: (i) Are semantic KGs better resources than 
Wikipedia anchor links for relatedness measure¬ 
ment? (ii) Is the DNN a better choice than Nor¬ 


malized Google Distance (NGD) (Cilibrasi and 
Vitanyi, 2001) and Vector Space Model (VSP) 
dSalton et al., 1975l l for relatedness measurement? 

In order to answer these two questions, we di¬ 
rectly apply NGD and VSP with the tf-idf repre¬ 
sentations on the same KG that we use to learn 
the DSRM. Then we combine them with the graph 
regularization model and study their impact on en¬ 
tity disambiguation. Table and show the relat¬ 
edness quality and disambiguation performance, 
respectively. As shown in the first three rows 
of both tables, we can clearly see that NGD and 
VSP based on KGs significantly outperform their 
variants with Wikipedia anchor links (p < 0.05), 
which confirms fhat semantic KGs are better re¬ 
sources than the Wikipedia anchor links for relat¬ 
edness measurement. This is because KGs contain 
cleaner semantic knowledge about entities than 
Wikipedia anchor links. For instance, “Apple Inc.” 
and “Barack Obama” share many noisy incoming 
links (e.g., “Austin, Texas” and “2010s”) that are 
not helpful to capture their relatedness. 

From the last three rows of Table and we 
can see that the DSRM based on DNN signifi¬ 
cantly outperform NGD and VSP for both related¬ 
ness measurement and entity disambiguation (p < 
0.05), illustrating that the DNN are indeed more 
effective to measure entity relatedness. By extract¬ 
ing useful semantic features layer by layer with 
nonlinear functions and transforming sparse bi¬ 
nary “one-hot” vectors into low-dimensional fea¬ 


Methods 

nDCG@l 

nDCG@5 

nDCG@10 

MAP 

M&W 

0.54 

0.52 

0.55 

0.48 

M&W 1234 

0.69 

0.58 

0.58 

0.51 

VSP 1234 

0.68 

0.58 

0.58 

0.52 

DSRM 1234 

0.81 

0.73 

0.74 

0.68 


Table 8 : Impact of semantic KGs and DNN on en¬ 
tity semantic relatedness. 


Methods 

AIDA dataset 

Tweet set 


micro 

macro 

macro 

macro 


P@1.0 

P@1.0 

P@1.0 

P@1.0 

M&W 

82.23 

81.10 

65.13 

66.20 

M&W 1234 

84.57 

83.81 

68.21 

69.17 

VSP 1234 

84.83 

83.49 

69.36 

70.20 

DSRM 1234 

86.58 

85.47 

71.89 

71.72 


Table 9: Impact of semantic KGs and DNN on en¬ 
tity disambiguation. 


ture vectors in a latent space, the DNN has better 
ability to represent entities semantically. 

6 Conclusions and Future Work 

We have introduced a deep semantic relatedness 
model based on DNN and semantic KGs for en¬ 
tity relatedness measurement. By encoding vari¬ 
ous semantic knowledge from KGs into DNN with 
multi-layer non-linear transformations, the DSRM 
can extract useful semantic features to represent 
entities. By developing an unsupervised graph 
regularization approach to model topical coher¬ 
ence with the proposed DSRM, we have achieved 
significant better performance than state-of-the- 
art entity disambiguation approaches. This work 
sheds light on exploring and modeling large-scale 
KGs with deep learning techniques for entity dis¬ 
ambiguation and other NLP tasks. Our future 
work includes direct encoding of semantic paths 
from KGs into neural networks. We also plan to 
design a joint model for entity disambiguation and 
entity relatedness measurement, which allows mu¬ 
tual improvement of both tasks and generates dy¬ 
namic and context-aware entity relatedness scores. 
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